2023 Sep 7^th – UQ PUG 1

Welcome to UQ Python User Group! Check out our general information for details about who we are and what we do.

Structure

We will start today by having everyone add their names to this page.
Add your questions to this page.
This Month’s Presentation.
Finally we will spend the rest of the session answering the questions you have brought!

This month’s presentation

Welcome to our first Python User Group gathering! This month, Luke and Cameron give an overview of the group, our vision and Noteable, the interactive collaborative notebook platform which can run markdown, Python, R and SQL.

Introduce yourself

What’s your name?	Where are you from?	Why are you here?
Luke	Library	Learn
Nick	Library	Learn
Valentina Urrutia Guada	Library	Learn
Nida	School of Molecular Biology	(1) Understanding the logic (in some codes) for bioinformatics and metagenomics application, (2) Learn how to visualize data in python that is hard to do in excel
Cameron	Library	Learn
Sam Hames	School of Languages and Cultures	Community
Nikhil	School of EECS	Learn
Paul Vrbik	EECS	Support
Jason Dail	SENV	Learn
Annie Nguyen	SENV	Learn :)

Research tools

Here are a few links we shared around, mostly from Jason.

to assist with academic citations
https://researchrabbitapp.com/home helpful when making research collections, mapping concepts, and looking at the linkages between references
https://consensus.app/
https://chat.openai.com/
https://article-summarizer.scholarcy.com/summarizer
https://typeset.io/ AI assistant for reading and understanding papers
https://www.listendata.com/2023/03/how-to-run-chatgpt-inside-excel.html (excel extension for chat GPT how-to)
EECS tutor help: https://eecs.uq.edu.au/current-students/eecs-learning-centre-tutors

Questions

If you have any Python questions you’d like to explore with the group, please put them in a markdown cell, with any code you’d like us to run in a Python cell.

Question 1 - Finding substrings for COVID sequencing - Note that the formatting has not transferred correctly here Nida

Nida has a problem where she needs to identify specific substrings from a large sequence of characters (DNA sequence). Her code is below

I just found out that the code:

covid_seq = getSequence('MN908947', 'genbank', DNA_Alphabet)

is actually trying to get a DNA sequence of covid from genbank & will give output of a string containing 29,900 characters (that we can actually see by clicking on the EBI website below)

it is use a function called "getSequence" with code: `def getSequence(entryId, dbName = ‘uniprotkb’, alphabet = Protein_Alphabet, format = ‘fasta’, debug: bool = True):

if not isinstance(entryId, str): entryId = entryId.decode(“utf-8”)

url =‘http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?style=raw&db=’ + dbName + ‘&format=’ + format + ‘&id=’ + entryId

try: if debug: print(‘DEBUG: Querying URL: {0}’.format(url))

data = urllib.request.urlopen(url).read() if format == ‘fasta’: return readFastaString(data.decode(“utf-8”), alphabet)[0]

else: return data.decode(“utf-8”)

except urllib.error.HTTPError as ex: raise RuntimeError(ex.read())

`#This function retrieves a single entry from a database (entryId: ID for entry e.g.’MN908947’, dbName: name of database e.g. ‘genbank’)

once we got that DNA string (consisting of only 4 types of characters-A/C/T/G), a biologist will translate them (into AminoAcid or Protein string, consisting 20 types of characters + asterisk/* , please see https://www.hgvs.org/mutnomen/codon.html). We’ll translate them using a dictionary (code not shown) and then we’ll split them based on * so that we can generate an output of a list of strings that is stored in a variable called “protseq” (see https://docs.google.com/document/d/1R22IGMfe9i1tYAlPK5ZSOikV-ON6xZVUq-C87xDMiLg/edit?usp=sharing )

finally, from that I need to find how many strings inside of that “protseq” that meet these criteria: 1) start with M 2) end with * 3) has length of >=100 characters

so, my question is actually: how to understand the logic behind the below code that is said to be able to do that job:

#check first occurence of M in each string of that ‘protseq’ list >>where_M_in_protseq_1st-string = protseq[0].find(‘M’) print(where_M_in_protseq_1st-string) print(len(protseq[0]))

#code to check whether each string in that ‘protseq’ start with M and >= 100 >>cnt = 0

for i in protseq: >if len(i) - i.find(“M”) >= 100: >>cnt +=1

print(cnt)

for seq in protseq: >m_pos = seq.find(“M”)
>m_end_seq = seq[m_pos:]

if len(m_end_seq) > 100: >print(m_end_seq) >print(len(m_end_seq))

The answer for this problem should be 8 strings inside 18 members of that protseq list will meet that criteria (so if we can end up getting 8 from that code we are correct)–> (but I just hope that I can get it correct and understand how the code works)

Importance: from that 8 strings, we can try 1 of them in a real protein database (called Uniprot-KB) and know what part of covid body that is likely to interact with human and causing disease

Thank you very much

import urllib

class Alphabet():
    """ A minimal class for alphabets
        Alphabets include DNA, RNA and Protein """
    def __init__(self, symbolString):
        self.symbols = symbolString
    def __len__(self):              # implements the "len" operator, e.g. "len(Alphabet('XYZ'))" results in 3
        return len(self.symbols)    # will tell you the length of the symbols in an Alphabet instance
    def __contains__(self, sym):    # implements the "in" operator, e.g. "'A' in Alphabet('ACGT')" results in True
        return sym in self.symbols  # will tell you if 'A' is in the symbols in an Alphabet instance
    def __iter__(self):             # method that allows us to iterate over all symbols, e.g. "for sym in Alphabet('ACGT'): print sym" prints A, C, G and T on separate lines
        tsyms = tuple(self.symbols)
        return tsyms.__iter__()
    def __getitem__(self, ndx):
        """ Retrieve the symbol(s) at the specified index (or slice of indices) """
        return self.symbols[ndx]
    def index(self, sym):
        """ Retrieve the index of the given symbol in the alphabet. """
        return self.symbols.index(sym)
    def __str__(self):
        return self.symbols

""" Below we declare alphabet variables that are going to be available when
this module (this .py file) is imported """
DNA_Alphabet = Alphabet('ACGT')
RNA_Alphabet = Alphabet('ACGU')
Protein_Alphabet = Alphabet('ACDEFGHIKLMNPQRSTVWY')
Protein_wX = Alphabet('ACDEFGHIKLMNPQRSTVWYX')
Protein_wGAP = Alphabet('ACDEFGHIKLMNPQRSTVWY-')


def getSequence(entryId, dbName = 'uniprotkb', alphabet = Protein_Alphabet, format = 'fasta', debug: bool = True):
    """ Retrieve a single entry from a database
    entryId: ID for entry e.g. 'P63166' or 'SUMO1_MOUSE'
    dbName: name of database e.g. 'uniprotkb' or 'pdb' or 'refseqn'; see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases for available databases
    format: file format specific to database e.g. 'fasta' or 'uniprot' for uniprotkb (see http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/dbfetch.databases)
    See http://www.ebi.ac.uk/Tools/dbfetch/syntax.jsp for more info re URL syntax
    """
    if not isinstance(entryId, str):
        entryId = entryId.decode("utf-8")
    url ='http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?style=raw&db=' + dbName + '&format=' + format + '&id=' + entryId
    try:
        if debug:
            print('DEBUG: Querying URL: {0}'.format(url))
        data = urllib.request.urlopen(url).read()
        if format == 'fasta':
            return readFastaString(data.decode("utf-8"), alphabet)[0]
        else:
            return data.decode("utf-8")
    except urllib.error.HTTPError as ex:
        raise RuntimeError(ex.read())


# get the covid 19 genome (29kB)
covid_seq = getSequence('MN908947', 'genbank', DNA_Alphabet)
# print(seq_no10)

# get all the bases from covid_seq
# translate all of those base into amino acid seq
# in all reading frames (6)





tr_f = [0, 1, 2]
# translate protein
covid_AAseq = []

for i in tr_f:
    #print("all ORF in fwd direction", covid_seq.translateDNA(i, True))
    seq10b=covid_seq.translateDNA(i, True)
    protseq=seq10b.split("*")
    seq10br = covid_seq.translateDNA(i, False)
    protseq.extend(seq10br.split("*"))
print(str(protseq))
#     print(protseq)
#     for element in protseq:
#         print("Individual value is",element)
# #for i in tr_f:
#     print("all ORF in reverse direction", covid_seq.translateDNA(i, False))
    
#NEXT STEP (for mapping the ORF that begins with M & calculate the len > 100)
ORF = []
for i in protseq:
    if i.startswith('M') == True:
        ORF.append(i)
print('all of potential ORF', ORF)

print('length of all potential ORF')
for i in ORF:
    print(i, ":", len(i))

true_ORF = []
for i in ORF:
    if len(i) > 100:
        true_ORF.append(i)
print(true_ORF)

# for i in protseq:
#     if i == 'M': # and len(i) > 100:
#         ORF.append(i)
# print(ORF)

characters = "asjidowgeriogpjicnjjlaksdjalksdj*alskjdjjjjeosj   hjjjjl"

# We want to pick out "jid" and "jic"
ans = []
for k, c in enumerate(characters):
    if c == "j":   # add abtrary number of constraints
        ans.append(characters[k:k+3])
    if c == "*":
        break

if len(ans[-1]) < 3:
    ans.pop()
        
print(ans)

characters = "asjidowgeriogpjicneosjl"